Landscape of species gene trees topologies inferred with 62 plastid protein-coding genes

Here we present the code used to calculate the vectors of RF and KC distances in two different ways:

  1. We look at the species trees landscape. In our study we perfomed species tree searches using 78 plastid protein-coding genes of 89 species of rosids (and 5 outgroups) in four different alignments:

gene (CDS+introns, nucleotide alignment),
exon (CDS, nucleotide alignment),
codon (CDS, codon alignment), and
amino acid (CDS, alignment of translated nucleotides)

The phylogenetic analysis were performed with these four matrices under Maximum likelihood (IQ-TREE) with different schemes of data partition (unpartitioned, partitioned by gene, and best scheme of partition) and the Multispecies Coalescent (MSC) method using a summary method (ASTRAL-II) and a complete search (SVDquartets).

  1. Then we explored the landscape of species trees compared to the distribution of gene tree topologies for each of our four matrices. Gene tree topologies were inferred using ML for each gene in separate runs for each of our four matrices. Here, we present the distance of gene trees representing 62 genes* to the species trees. Note that the species trees were inferred with 78 genes.

Along this tutorial, we use the following acronyms: gene - ge exon - ex codon - cd amino acid - aa IQ-TREE - iq ASTRAL-II - as SVDQuartets - sv concatenated - c gene partition - gp best schme of partition - bs

First, upload the following packages:

Treespace package requires all trees to have the same terminals. In other words, if you have species (i.e., terminals) that lost one or a few genes, you will have to remove those species or those gene from the analysis. Here we removed three species that had a high number of missing data and 16 genes.

Tip: We used sumtrees.py to suppress the annotations and to root the trees. We need root the trees now to use it in KC. However, we will have to unroot them for RF part.

#First, using sumtrees.py, remove all the annotations of the tree
#for file in *.tre; do sumtrees.py -F newick --suppress-annotations -p -d 0 --root-target-at-outgroup "Apiales_Araliaceae_Aralia_undulata" -o ../output_nwk/${file//.tre/.nwk} $file; done

After we suppressed the annotations, we preparee a nexus file with all the trees and labelled them accordingly.

Setting the working directory before start.

setwd("/Users/deisejpg/Documents/1_Plastid_genome/phyloanalysis/treespace/update_feb2018")
getwd()
## [1] "/Users/deisejpg/Documents/1_Plastid_genome/phyloanalysis/treespace/update_feb2018"

Next, we removed three species from the dataset so all the trees have the same terminals. Here we also rooted the trees, needed for KC metric.

#Prepare a nexus file with the trees that you want to drop the tips
#trees_droptip <- read.nexus("./input_prep/dropping_tips.nex")

#As I have multiple trees in a single nexus file, I am using lapply in addition to drop.tip, so the three species indicated are removed from all trees
#trees_droptip <- lapply(trees_droptip, drop.tip, c("Caryophyllales_Caryophyllaceae_Silene_capitata",     "Myrtales_Myrtaceae_Xanthostemon_chrysanthus", "Fabales_Polygalaceae_Polygala_alba"))

#Then I verify the class of my tree object and I assign it as a "multiphylo" object, meaning that it has multiple trees
#class(trees_droptip)
#class(trees_droptip) <- "multiPhylo"
#class(trees_droptip)

#Saving a file with rooted trees without the three terminals that we dropped
#write.tree(trees_droptip, file="./input_prep/tipsdropped_rooted.nwk", d=0)
#To make sure my trees are unrooted (necessary for RF) I use unroot() and then I save the new trees in a newick file
#trees_droptip <- unroot(trees_droptip)
#write.tree(trees_droptip, file="./input_prep/tipsdropped_unrooted.nwk", d=0)

Then we opened the file, and prepared a new nexus file with all species trees from all of our analysis. We labelled the trees with information about method, matrix, and partitioned scheme (for IQ-TREE).

1. Species Tree Landscape

1.1. Robinson-Foulds (unrooted trees)

Distance between species trees inferred using MSC and ML (concatenated and partitioned by gene and by best scheme):

#reading the nexus file with all the trees and confirming they are unrooted
tree_allspp_nrtd <- read.nexus("./input_unrooted/allsptrees_tipsdropped_unrooted.nex")
tree_allspp_nrtd
## 19 phylogenetic trees
#class(tree_allspp_nrtd)
is.rooted.multiPhylo(tree_allspp_nrtd)
##    as_ge    as_ex    as_cd    as_aa    sv_ge    sv_ex    sv_cd  iq_ge_c 
##    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE 
##  iq_ex_c  iq_cd_c  iq_aa_c iq_ge_gp iq_ex_gp iq_cd_gp iq_aa_gp iq_ge_bs 
##    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE 
## iq_ex_bs iq_cd_bs iq_aa_bs 
##    FALSE    FALSE    FALSE

Getting the names of the trees:

tree_names_allspp_nrtd <- names(tree_allspp_nrtd)
tree_names_allspp_nrtd
##  [1] "as_ge"    "as_ex"    "as_cd"    "as_aa"    "sv_ge"    "sv_ex"   
##  [7] "sv_cd"    "iq_ge_c"  "iq_ex_c"  "iq_cd_c"  "iq_aa_c"  "iq_ge_gp"
## [13] "iq_ex_gp" "iq_cd_gp" "iq_aa_gp" "iq_ge_bs" "iq_ex_bs" "iq_cd_bs"
## [19] "iq_aa_bs"

Now we calculate the distances between the trees using “RF” for unweighted Robinson-Foulds distance:

treespace_allspp_nrtd <- treespace(tree_allspp_nrtd, method = "RF", nf = 18)
## Warning in is.euclid(distmat): Zero distance(s)
#If you see "Zero distance(s)" you can check below and look for the tree(s) that are exactly the same
#head(treespace_allspp_nrtd)
#Here, iq_ex_c distance fom iq_ex_bs is 0, meaning that these two trees have no differences

#Now you can see the values of the PCoA axis
#head(treespace_allspp_nrtd$pco$li)

We use the round() and sum() to verify the percentage explained by each axis in the ordination:

#Eigen values are measurements of variance, so to calculate the percentage explained by each axis, just divide the eigen values  by the sum of all eigen values
#Number three indicates three digits
round(treespace_allspp_nrtd$pco$eig/sum(treespace_allspp_nrtd$pco$eig),3)
##  [1] 0.345 0.202 0.115 0.083 0.069 0.054 0.053 0.029 0.014 0.013 0.012
## [12] 0.005 0.003 0.002 0.001
#Axis1 = 0.345, Axis2 = 0.202, Axis3 = 0.115; Axis4 = 0.083, Axis5 = 0.069

Then we find the clusters or trees using unweighted Robinson-Foulds:

#Since this step is interactive, I used the lines below to determine that the number of clusters is 2
#id_cluster_nrtd <- findGroves(treespace_allspp_nrtd, method = "RF")
#"Please define a cutoff point:""
#40
#y

#Now I run the same code defining "nclust = 2"
id_cluster_nrtd <- findGroves(treespace_allspp_nrtd, method = "RF", nclust = 2)

Then we plot again to check:

plotGrovesD3(id_cluster_nrtd, treeNames = tree_names_allspp_nrtd)
#head(id_cluster_nrtd)

Now we include the information about the software utilized in the phylogenetic inference:

methods_as_sv_iq <- read.csv("allsptrees.csv", h=F)
class(methods_as_sv_iq)
## [1] "data.frame"
#head(methods_as_sv_iq)

We then gathered vectors with usefull information and added in a new dataframe which includes: the axis of the ordination calculated by treespace(), cluster information about the methods, and the names of the trees, and groups with groups that clustered, calculated with findGroves().

shorter_names <- c("ge", "ex", "cd", "aa", "ge", "ex", "cd", "ge_c", "ex_c", "cd_c", "aa_c", "ge_gp", "ex_gp", "cd_gp", "aa_gp", "ge_bs", "ex_bs", "cd_bs", "aa_bs")
df_nrtd <- data.frame(id_cluster_nrtd$treespace$pco$li, dataset=tree_names_allspp_nrtd,
                      Method=methods_as_sv_iq$V1, group=id_cluster_nrtd$groups,
                      name=rownames(id_cluster_nrtd$treespace$pco$li), Shorter_names=shorter_names)
#head(df_nrtd)

Now we included a last column with information whether the tree was inferred using MCM or ML method:

df_nrtd["coalesc_x_concat"]<- c("mcm", "mcm", "mcm", "mcm", "mcm", "mcm", "mcm", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree")
class(df_nrtd$coalesc_x_concat)
## [1] "character"
#head(df_nrtd)

Finally, we gathered the level of the three software utilized and reordered the “coalesc_x_concat” column so we have data from astral and svdq in sequence.

df_nrtd$Method <- factor(df_nrtd$Method, levels = as.ordered(levels(df_nrtd$Method)))
df_nrtd_arranged <- arrange(df_nrtd, desc(coalesc_x_concat))
#head(df_nrtd_arranged)

Now we plot the Robinson-Foulds graph.

#This is to put in order: "astral, svdq, iqtree"
df_nrtd$Method <- factor(df_nrtd$Method, levels(df_nrtd$Method)[c(1,3,2)]) 

#Now ploting the species tree landscape
plot1_RF <- ggplot(df_nrtd, aes(x=A1, y=A2, size=3, label=Shorter_names, color=Method, shape=Method)) + 
  geom_point(alpha=1) + 
  guides(size=F) + geom_hline(yintercept = 0, linetype="dashed") +
  geom_vline(xintercept = 0, linetype="dashed") +
  geom_text_repel(size=6, box.padding=1.5, point.padding = 0.8, show.legend=FALSE) +
  labs(x="PCoA1 (34.5%)", y=" PCoA2 (20.2%)") + 
  theme_classic() + ggtitle("Robinson-Foulds") +
  theme(legend.text=element_text(size=20),legend.title=element_text(size=20), plot.title = element_text(size=20, hjust = 0.5)) +
  guides(colour = guide_legend(override.aes = list(size=5))) +
  theme(axis.title=element_text(size = 18,face="bold"), axis.text=element_text(size=14,face="bold")) +
  scale_colour_manual(values = c("#000000", "#E69F00", "#56B4E9"))+
  scale_shape_manual(values=c(17, 17, 19)) 
plot1_RF

1.2. Kendall-Colijn (rooted trees)

Distance between species trees inferred using MCM and ML (concatenated and partitioned by gene and by best scheme):

trees_allspp_rtd <- read.nexus("./input_rooted/allsptrees_tipsdropped_rooted.nex")
trees_allspp_rtd
## 19 phylogenetic trees
trees_allspp_rtd <- root(trees_allspp_rtd, "Apiales_Araliaceae_Aralia_undulata", resolve.root = TRUE)
class(trees_allspp_rtd)
## [1] "multiPhylo"
is.rooted.multiPhylo(trees_allspp_rtd)
##    as_ge    as_ex    as_cd    as_aa    sv_ge    sv_ex    sv_cd  iq_ge_c 
##     TRUE     TRUE     TRUE     TRUE     TRUE     TRUE     TRUE     TRUE 
##  iq_ex_c  iq_cd_c  iq_aa_c iq_ge_gp iq_ex_gp iq_cd_gp iq_aa_gp iq_ge_bs 
##     TRUE     TRUE     TRUE     TRUE     TRUE     TRUE     TRUE     TRUE 
## iq_ex_bs iq_cd_bs iq_aa_bs 
##     TRUE     TRUE     TRUE
trees_allspp_rtd_names <- names(trees_allspp_rtd)
trees_allspp_rtd_names
##  [1] "as_ge"    "as_ex"    "as_cd"    "as_aa"    "sv_ge"    "sv_ex"   
##  [7] "sv_cd"    "iq_ge_c"  "iq_ex_c"  "iq_cd_c"  "iq_aa_c"  "iq_ge_gp"
## [13] "iq_ex_gp" "iq_cd_gp" "iq_aa_gp" "iq_ge_bs" "iq_ex_bs" "iq_cd_bs"
## [19] "iq_aa_bs"

Now we calculate the distances between the trees using KC, the default of treespace():

#Calculating the treespace
treespace_allspp_rtd <- treespace(trees_allspp_rtd, nf=18)
#head(treespace_allspp_rtd)
#head(treespace_allspp_rtd$pco$li)

We use the round() and sum() to verify the percentage explained by each axis in the ordination:

round(treespace_allspp_rtd$pco$eig/sum(treespace_allspp_rtd$pco$eig),3)
##  [1] 0.972 0.015 0.010 0.001 0.001 0.000 0.000 0.000 0.000 0.000 0.000
## [12] 0.000 0.000
#Axis1 = 97.2%, Axis2 = 1.5%, Axis3 = 1.0%

Then we find the clusters or trees using Kendall-Colijn metric:

id_cluster_rtd <- findGroves(treespace_allspp_rtd, method = "treeVec", nclust = 3)
#125
#y

Then we plot again to check:

plotGrovesD3(id_cluster_rtd, treeNames = trees_allspp_rtd_names)

Now we include the information about the software utilized in the phylogenetic inferrence:

methods_as_sv_iq <- read.csv("allsptrees.csv", h=F)
class(methods_as_sv_iq)
## [1] "data.frame"
#head(methods_as_sv_iq)

Note that the object above is a datafram, so “V1” is a vector and to gather its information we had to use “methods_as_sv_iq$V1” in the code below.

And then we prepared a dataframe with all the information that we will need to plot the data:

shorter_names <- c("ge", "ex", "cd", "aa", "ge", "ex", "cd", "ge_c", "ex_c", "cd_c", "aa_c", "ge_gp", "ex_gp", "cd_gp", "aa_gp", "ge_bs", "ex_bs", "cd_bs", "aa_bs")
df_rtd <- data.frame(id_cluster_rtd$treespace$pco$li, dataset=trees_allspp_rtd_names,
                     Method=methods_as_sv_iq$V1, group=id_cluster_rtd$groups,
                     name=rownames(id_cluster_rtd$treespace$pco$li), shorter_names=shorter_names)
#head(df_rtd)
df_rtd["coalesc_x_concat"]<- c("mcm", "mcm", "mcm", "mcm", "mcm", "mcm", "mcm", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree")
class(df_rtd$coalesc_x_concat)
## [1] "character"
#head(df_rtd)
df_rtd$Method <- factor(df_rtd$Method, levels = as.ordered(levels(df_rtd$Method)))
df_rtd_arranged <- arrange(df_rtd, desc(df_rtd$coalesc_x_concat))
#head(df_rtd_arranged)
#change order to avoid overlapping
df2_rtd_arranged <- arrange(df_rtd_arranged, desc(Method))
#head(df2_rtd_arranged)
df2_rtd_arranged$Method <- factor(df2_rtd_arranged$Method, levels(df2_rtd_arranged$Method)[c(1,3,2)]) 

plot2_KC <- ggplot(df2_rtd_arranged, aes(x=A1, y=A2, color=Method, size=4, label=shorter_names, shape=Method)) +
  geom_point(alpha=1) + 
  guides(size=F) + geom_hline(yintercept = 0, linetype="dashed") +
  geom_vline(xintercept = 0, linetype="dashed") +
  geom_text_repel(size=6, box.padding=1.5, point.padding = 0.8, show.legend=FALSE) +
  labs(x="PCoA1 (97.2%)", y=" PCoA2 (1.5%)") + 
  theme_classic() + ggtitle("Kendall-Colijn") +
  theme(legend.text=element_text(size=20),legend.title=element_text(size=20), 
        plot.title = element_text(size=20, hjust = 0.5)) +
  guides(colour = guide_legend(override.aes = list(size=5))) +
  theme(axis.title=element_text(size = 18,face="bold"), axis.text=element_text(size=14,face="bold")) +
  scale_colour_manual(values = c("#000000", "#E69F00", "#56B4E9"))+
  scale_shape_manual(values=c(17, 17, 19))
plot2_KC

Now we combine both graphs using plot_grid() First, create a legend for plot1_RF, then combine both graphs without a legend, and finalys adding a legend:

#Here we create an object that contains the legend that will be used in the grid to combine two graphs
legend <- get_legend(plot1_RF + theme(legend.position="bottom"))

two_plots <- plot_grid(plot1_RF + theme(legend.position="none"), plot2_KC +
                      theme(legend.position="none"), 
                      align = 'vh',
                      labels = c("a)", "b)"), 
                      hjust = -1, nrow = 2) 
two_plots

#Now adding the legend
two_plots_onelegend <- plot_grid(two_plots, legend, rel_heights = c(6, .5), hjust = 1, nrow = 2)
#two_plots_onelegend

#Finally adding a title
title <- ggdraw() + draw_label("Species Trees Landscape", fontface='bold', size = 18)
sp_tree_plot <- plot_grid(title, two_plots_onelegend, ncol=1, rel_heights=c(.1, 1.5), rel_widths = .1)
sp_tree_plot

#Saving pdf determining the size of the figure:
save_plot("./Figure5.pdf", sp_tree_plot, base_width=6, base_height=12)

Now let’s look at species trees versus gene trees, using RF and KC metric for each dataset

2. Gene trees versus species trees

2.1. Gene matrix

2.1.1. RF (gene matrix)

#tree <- read.nexus("genetrees.nex")
#first let's confirm we dropped the tips of species that are not present in all trees and let's unroot the tree
#tree_gerf <- lapply(tree, drop.tip, c("Caryophyllales_Caryophyllaceae_Silene_capitata", "Myrtales_Myrtaceae_Xanthostemon_chrysanthus", "Fabales_Polygalaceae_Polygala_polygama"))
#class(tree)
#class(tree) <- "multiPhylo"
#class(tree)
#tree <- unroot(tree)
#write.tree(tree, file="genetrees_unrooted.nwk", d=0)
tree_gerf <- read.nexus("./input_unrooted/all_spgenetree_ge_rf.nex")
class(tree_gerf)
## [1] "multiPhylo"
tree_names_gerf <- names(tree_gerf)
tree_names_gerf
##  [1] "astral"        "svdq"          "concatenated"  "genepartition"
##  [5] "bestscheme"    "atpA"          "atpB"          "atpE"         
##  [9] "atpF"          "atpH"          "atpI"          "cemA"         
## [13] "matK"          "ndhA"          "ndhB"          "ndhC"         
## [17] "ndhD"          "ndhE"          "ndhF"          "ndhG"         
## [21] "ndhI"          "ndhJ"          "petB"          "petD"         
## [25] "petG"          "petL"          "petN"          "psaA"         
## [29] "psaB"          "psaC"          "psaI"          "psaJ"         
## [33] "psbA"          "psbB"          "psbC"          "psbD"         
## [37] "psbE"          "psbF"          "psbH"          "psbI"         
## [41] "psbJ"          "psbK"          "psbM"          "psbN"         
## [45] "psbT"          "psbZ"          "rbcL"          "rpl14"        
## [49] "rpl20"         "rpl23"         "rpl2"          "rpl33"        
## [53] "rpl36"         "rpoB"          "rpoC1"         "rpoC2"        
## [57] "rps11"         "rps12"         "rps14"         "rps15"        
## [61] "rps18"         "rps2"          "rps3"          "rps4"         
## [65] "rps7"          "rps8"          "ycf3"
is.rooted.multiPhylo(tree_allspp_nrtd)
##    as_ge    as_ex    as_cd    as_aa    sv_ge    sv_ex    sv_cd  iq_ge_c 
##    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE 
##  iq_ex_c  iq_cd_c  iq_aa_c iq_ge_gp iq_ex_gp iq_cd_gp iq_aa_gp iq_ge_bs 
##    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE 
## iq_ex_bs iq_cd_bs iq_aa_bs 
##    FALSE    FALSE    FALSE
method_gerf <- read.csv("all_spgenetree.csv", h=F)
class(method_gerf)
## [1] "data.frame"
#Calculating the distances
treespace_gerf <- treespace(tree_gerf,  method = "RF", nf=66)
#head(treespace_gerf)
#head(treespace_gerf$pco$li)
plotGrovesD3(treespace_gerf$pc, treeNames = tree_names_gerf)
#Detecting the number of groups
#id_cluster_gerf <- findGroves(treespace_gerf, method = "RF")
#250
#y

id_cluster_gerf <- findGroves(treespace_gerf, method = "RF", nclust = 2)
plotGrovesD3(id_cluster_gerf, treeNames = tree_names_gerf)
#making a dataframe with the coordinates of the trees calculated by treespace(), cluster information, and rownmes 
#for names of trees; group is not necessary because I have already calculated the number of groups with findGroves()
df_gerf <- data.frame(id_cluster_gerf$treespace$pco$li, dataset=tree_names_gerf, Method=method_gerf$V1,
                      group=id_cluster_gerf$groups, name=rownames(id_cluster_gerf$treespace$pco$li))
#head(df_gerf)
#The names of all gene trees
#df_gerf["sptree_vs_getree_1"] <- c("as", "sv", "c", "gp", "bs", "atpA", "atpB", "atpE", "atpF", "atpH", "atpI", "cemA", "matK", "ndhA", "ndhB", "ndhC", "ndhD", "ndhE", "ndhF", "ndhG", "ndhI", "ndhJ", "petB", "petD", "petG", "petL", "petN", "psaA", "psaB", "psaC", "psaI", "psaJ", "psbA", "psbB", "psbC", "psbD", "psbE", "psbF", "psbH", "psbI", "psbJ ", "psbK ", "psbM", "psbN", "psbT", "psbZ", "rbcL", "rpl14", "rpl20", "rpl23", "rpl2", "rpl33", "rpl36", "rpoB", "rpoC1", "rpoC2", "rps11", "rps12", "rps14", "rps15", "rps18", "rps2", "rps3", "rps4", "rps7", "rps8", "ycf3" )
#head(df_gerf)
df_gerf["sptree_vs_getree"] <- c("as", "sv", "c", "gp", "bs", NA, NA, NA, NA, NA, NA, NA, "matK", NA, NA, NA, NA, NA, "ndhF", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "rbcL", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
#head(df_gerf)
#change order to avoid overlapping
df2_ge_arrangedrf <- arrange(df_gerf, desc(Method))
#head(df2_ge_arrangedrf)

#getting the name for the legend
#colnames(df2_ge_arrangedrf$Method)
#We use the round() and sum() to verify the percentage explained by each axis in the ordination
round(treespace_gerf$pco$eig/sum(treespace_gerf$pco$eig),3)
##  [1] 0.157 0.049 0.042 0.041 0.040 0.039 0.037 0.036 0.035 0.033 0.031
## [12] 0.029 0.028 0.027 0.026 0.024 0.023 0.022 0.020 0.019 0.018 0.016
## [23] 0.015 0.014 0.014 0.012 0.011 0.011 0.011 0.009 0.009 0.008 0.008
## [34] 0.007 0.007 0.006 0.006 0.006 0.005 0.005 0.004 0.004 0.004 0.004
## [45] 0.003 0.003 0.002 0.002 0.002 0.002 0.002 0.001 0.001 0.001 0.001
## [56] 0.001 0.001 0.001 0.001 0.001 0.000 0.000 0.000 0.000 0.000
#Axis1 = 0.157, Axis2 = 0.049, Axis3 = 0.042, Axis4 = 0.041, Axis5 = 0.040
#Now we plot the graph
df2_ge_arrangedrf$Method <- factor(df2_ge_arrangedrf$Method, levels(df2_ge_arrangedrf$Method)[c(1,5,4,2,3,6)])
plot_RF_ge <- ggplot(df2_ge_arrangedrf, aes(x=A1, y=A2, color = Method, label = sptree_vs_getree)) +
  geom_point(aes(color=Method, shape=Method),alpha=.8, size=4) +
  guides(size=F) + geom_hline(yintercept = 0, linetype="dashed") +
  geom_vline(xintercept = 0, linetype="dashed") +
  geom_text_repel(size=6, box.padding=2, point.padding = 0.8, show.legend=FALSE) +
  labs(x="PCoA1 (15.7%)", y=" PCoA2 (4.9%)") + 
  theme_classic() + ggtitle("Robinson-Foulds - gene") +
  theme(legend.text=element_text(size=12),legend.title=element_text(size=15),
        plot.title = element_text(hjust = 0.5)) +
  guides(colour = guide_legend(override.aes = list(size=5))) +
  theme(axis.title=element_text(size = 12,face="bold"), axis.text=element_text(size=10,face="bold"))  +
  scale_colour_manual(values = c("#000000", "#E69F00", "#CC79A7", "#009E73", "#F0E442", "#56B4E9"))+
  scale_shape_manual(values=c(17, 17, 19, 19, 19, 19))
plot_RF_ge
## Warning: Removed 59 rows containing missing values (geom_text_repel).

2.1.2. KC (gene matrix)

tree_gekc <- read.nexus("./input_rooted/all_spgenetree_ge_kc.nex")
tree_gekc
## 67 phylogenetic trees
tree_gekc <- root(tree_gekc, "Apiales_Araliaceae_Aralia_undulata", resolve.root = TRUE)
class(tree_gekc)
## [1] "multiPhylo"
is.rooted.multiPhylo(tree_gekc)
##           astral             svdq  iq_concatenated iq_genepartition 
##             TRUE             TRUE             TRUE             TRUE 
##    iq_bestscheme             atpA             atpB             atpE 
##             TRUE             TRUE             TRUE             TRUE 
##             atpF             atpH             atpI             cemA 
##             TRUE             TRUE             TRUE             TRUE 
##             matK             ndhA             ndhB             ndhC 
##             TRUE             TRUE             TRUE             TRUE 
##             ndhD             ndhE             ndhF             ndhG 
##             TRUE             TRUE             TRUE             TRUE 
##             ndhI             ndhJ             petB             petD 
##             TRUE             TRUE             TRUE             TRUE 
##             petG             petL             petN             psaA 
##             TRUE             TRUE             TRUE             TRUE 
##             psaB             psaC             psaI             psaJ 
##             TRUE             TRUE             TRUE             TRUE 
##             psbA             psbB             psbC             psbD 
##             TRUE             TRUE             TRUE             TRUE 
##             psbE             psbF             psbH             psbI 
##             TRUE             TRUE             TRUE             TRUE 
##             psbJ             psbK             psbM             psbN 
##             TRUE             TRUE             TRUE             TRUE 
##             psbT             psbZ             rbcL            rpl14 
##             TRUE             TRUE             TRUE             TRUE 
##            rpl20            rpl23             rpl2            rpl33 
##             TRUE             TRUE             TRUE             TRUE 
##            rpl36             rpoB            rpoC1            rpoC2 
##             TRUE             TRUE             TRUE             TRUE 
##            rps11            rps12            rps14            rps15 
##             TRUE             TRUE             TRUE             TRUE 
##            rps18             rps2             rps3             rps4 
##             TRUE             TRUE             TRUE             TRUE 
##             rps7             rps8             ycf3 
##             TRUE             TRUE             TRUE
tree_names_gekc <- names(tree_gekc)
tree_names_gekc
##  [1] "astral"           "svdq"             "iq_concatenated" 
##  [4] "iq_genepartition" "iq_bestscheme"    "atpA"            
##  [7] "atpB"             "atpE"             "atpF"            
## [10] "atpH"             "atpI"             "cemA"            
## [13] "matK"             "ndhA"             "ndhB"            
## [16] "ndhC"             "ndhD"             "ndhE"            
## [19] "ndhF"             "ndhG"             "ndhI"            
## [22] "ndhJ"             "petB"             "petD"            
## [25] "petG"             "petL"             "petN"            
## [28] "psaA"             "psaB"             "psaC"            
## [31] "psaI"             "psaJ"             "psbA"            
## [34] "psbB"             "psbC"             "psbD"            
## [37] "psbE"             "psbF"             "psbH"            
## [40] "psbI"             "psbJ"             "psbK"            
## [43] "psbM"             "psbN"             "psbT"            
## [46] "psbZ"             "rbcL"             "rpl14"           
## [49] "rpl20"            "rpl23"            "rpl2"            
## [52] "rpl33"            "rpl36"            "rpoB"            
## [55] "rpoC1"            "rpoC2"            "rps11"           
## [58] "rps12"            "rps14"            "rps15"           
## [61] "rps18"            "rps2"             "rps3"            
## [64] "rps4"             "rps7"             "rps8"            
## [67] "ycf3"
#This is to identify metadata
method_gekc <- read.csv("all_spgenetree.csv", h=F)
#Calculating the treespace
treespace_gekc <- treespace(tree_gekc, nf=66)
#head(treespace_gekc)
#head(treespace_gekc$pco$li)
plotGrovesD3(treespace_gekc$pc, treeNames = tree_names_gekc)
#Detecting the number of groups
#id_cluster_gekc <- findGroves(treespace_gekc)
#600
#y

id_cluster_gekc <- findGroves(treespace_gekc, nclust = 2)
plotGrovesD3(id_cluster_gekc)
#making a dataframe with the coordinates of the trees calculated by treespace(), cluster information, and rownmes 
#for names of trees; group is not necessary because I have already calculated the number of groups with findGroves()
df_gekc <- data.frame(id_cluster_gekc$treespace$pco$li, dataset=tree_names_gekc, Method=method_gekc$V1,
                    group=id_cluster_gekc$groups, name=rownames(id_cluster_gekc$treespace$pco$li))
#head(df_gekc)
df_gekc["sptree_vs_getree"] <- c("as", "sv", "c", "gp", "bs", NA, NA, NA, NA, NA, NA, NA, "matK", NA, NA, NA, NA, NA, "ndhF", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "rbcL", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
#head(df_gekc)
#change order to avoid overlapping
df2_ge_arrangedkc <- arrange(df_gekc, desc(Method))
#head(df2_ge_arrangedkc)
#We use the round() and sum() to verify the percentage explained by each axis in the ordination
round(treespace_gekc$pco$eig/sum(treespace_gekc$pco$eig),3)
##  [1] 0.515 0.082 0.060 0.048 0.032 0.027 0.025 0.023 0.021 0.017 0.015
## [12] 0.014 0.013 0.012 0.010 0.010 0.008 0.007 0.007 0.006 0.004 0.004
## [23] 0.004 0.003 0.003 0.003 0.003 0.002 0.002 0.002 0.002 0.002 0.001
## [34] 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.000 0.000 0.000 0.000
## [45] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [56] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
#Axis1 = 0.515, Axis2 = 0.082, Axis3 = 0.060, Axis4 = 0.048, Axis5 = 0.032
#Now we plot the graph
df2_ge_arrangedkc$Method <- factor(df2_ge_arrangedkc$Method, levels(df2_ge_arrangedkc$Method)[c(1,2,3,6,5,4)])

plot_KC_ge <- ggplot(df2_ge_arrangedkc, aes(x=A1, y=A2, color = Method, label = sptree_vs_getree)) +
  geom_point(aes(color=Method, shape=Method), alpha=.8, size=4) + 
  guides(size=F) + geom_hline(yintercept = 0, linetype="dashed") +
  geom_vline(xintercept = 0, linetype="dashed") +
  geom_text_repel(size=6, box.padding=2, point.padding = 0.9, show.legend=FALSE) +
  labs(x="PCoA1 (51.5%)", y=" PCoA2 (8.2%)") + 
  theme_classic() + ggtitle("Kendall-Colijn - gene") +
  theme(legend.text=element_text(size=12),legend.title=element_text(size=15),
        plot.title = element_text(hjust = 0.5)) +
  guides(colour = guide_legend(override.aes = list(size=5))) +
  theme(axis.title=element_text(size = 12,face="bold"), axis.text=element_text(size=10,face="bold"))  +
  scale_colour_manual(values = c("#000000", "#E69F00", "#CC79A7", "#009E73", "#F0E442", "#56B4E9"))+
  scale_shape_manual(values=c(17, 17, 19, 19, 19, 19))
plot_KC_ge
## Warning: Removed 59 rows containing missing values (geom_text_repel).

2.2. Amino Acid matrix

2.2.1. RF (amino acid matrix)

setwd("/Users/deisejpg/Documents/1_Plastid_genome/phyloanalysis/treespace/update_feb2018/")
getwd()
## [1] "/Users/deisejpg/Documents/1_Plastid_genome/phyloanalysis/treespace/update_feb2018"
#first let's confirm we dropped the tips of species that are not present in all trees and let's unroot the tree
#tree_aarf <- lapply(tree_aarf, drop.tip, c("Caryophyllales_Caryophyllaceae_Silene_capitata",
#                                           "Myrtales_Myrtaceae_Xanthostemon_chrysanthus",
#                                           "Fabales_Polygalaceae_Polygala_polygama"))
#class(tree_aarf)
#class(tree_aarf) <- "multiPhylo"
#class(tree_aarf)
#tree_aarf <- unroot(tree_aarf)
#write.tree(tree_aarf, file="genetrees_aa_unrooted.nwk", d=0)
tree_aarf <- read.nexus("./input_unrooted/all_spgenetree_aa_rf.nex")
class(tree_aarf)
## [1] "multiPhylo"
is.rooted.multiPhylo(tree_aarf)
##        astral  concatenated genepartition    bestscheme          atpA 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          atpB          atpE          atpF          atpH          atpI 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          cemA          matK          ndhA          ndhB          ndhC 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          ndhD          ndhE          ndhF          ndhG          ndhI 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          ndhJ          petB          petD          petG          petL 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          petN          psaA          psaB          psaC          psaI 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          psaJ          psbA          psbB          psbC          psbD 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          psbE          psbF          psbH          psbI          psbJ 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          psbK          psbM          psbN          psbT          psbZ 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          rbcL         rpl14         rpl20         rpl23          rpl2 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##         rpl33         rpl36          rpoB         rpoC1         rpoC2 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##         rps11         rps12         rps14         rps15         rps18 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          rps2          rps3          rps4          rps7          rps8 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          ycf3 
##         FALSE
tree_names_aarf <- names(tree_aarf)
tree_names_aarf
##  [1] "astral"        "concatenated"  "genepartition" "bestscheme"   
##  [5] "atpA"          "atpB"          "atpE"          "atpF"         
##  [9] "atpH"          "atpI"          "cemA"          "matK"         
## [13] "ndhA"          "ndhB"          "ndhC"          "ndhD"         
## [17] "ndhE"          "ndhF"          "ndhG"          "ndhI"         
## [21] "ndhJ"          "petB"          "petD"          "petG"         
## [25] "petL"          "petN"          "psaA"          "psaB"         
## [29] "psaC"          "psaI"          "psaJ"          "psbA"         
## [33] "psbB"          "psbC"          "psbD"          "psbE"         
## [37] "psbF"          "psbH"          "psbI"          "psbJ"         
## [41] "psbK"          "psbM"          "psbN"          "psbT"         
## [45] "psbZ"          "rbcL"          "rpl14"         "rpl20"        
## [49] "rpl23"         "rpl2"          "rpl33"         "rpl36"        
## [53] "rpoB"          "rpoC1"         "rpoC2"         "rps11"        
## [57] "rps12"         "rps14"         "rps15"         "rps18"        
## [61] "rps2"          "rps3"          "rps4"          "rps7"         
## [65] "rps8"          "ycf3"
#This is to identify metadata
method_aarf <- read.csv("all_spgenetree_aa.csv", h=F)
#Calculating the treespace
treespace_aarf <- treespace(tree_aarf,  method = "RF", nf=65)
#head(treespace_aarf)
#head(treespace_aarf$pco$li)
plotGrovesD3(treespace_aarf$pc, treeNames = tree_names_aarf)
#Detecting the number of groups
#id_cluster_aarf <- findGroves(treespace_aarf, , method = RF)
#300
#y

id_cluster_aarf <- findGroves(treespace_aarf, , method = RF, nclust = 2)
plotGrovesD3(id_cluster_aarf)
#making a dataframe with the coordinates of the trees calculated by treespace(), cluster information, and rownmes 
#for names of trees; group is not necessary because I have already calculated the number of groups with findGroves()
df_aarf <- data.frame(id_cluster_aarf$treespace$pco$li, dataset=tree_names_aarf, Method=method_aarf$V1, group=id_cluster_aarf$groups, name=rownames(id_cluster_aarf$treespace$pco$li))
#head(df_aarf)
df_aarf["sptree_vs_getree"] <- c("as", "c", "gp", "bs", NA, NA, NA, NA, NA, NA, NA, "matK", NA, NA, NA, NA, NA, "ndhF", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "rbcL", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
#head(df_aarf)
#change order to avoid overlapping
df2_aa_arrangedrf <- arrange(df_aarf, desc(Method))
#head(df2_aa_arrangedrf)
#We use the round() and sum() to verify the percentage explained by each axis in the ordination
round(treespace_aarf$pco$eig/sum(treespace_aarf$pco$eig),3)
##  [1] 0.168 0.029 0.027 0.026 0.025 0.024 0.024 0.024 0.023 0.023 0.023
## [12] 0.022 0.022 0.021 0.021 0.021 0.021 0.021 0.020 0.020 0.020 0.019
## [23] 0.019 0.018 0.018 0.017 0.016 0.016 0.016 0.015 0.014 0.014 0.013
## [34] 0.012 0.012 0.012 0.011 0.011 0.010 0.010 0.009 0.009 0.009 0.008
## [45] 0.008 0.007 0.006 0.006 0.006 0.005 0.005 0.005 0.004 0.003 0.003
## [56] 0.003 0.002 0.001 0.001 0.001 0.000 0.000 0.000 0.000
#Axis1 = 0.168, Axis2 = 0.029, Axis3=0.027, Axis4=0.026, Axis5=0.025
#Now we plot the graph
#the next line is not necessary here because we don't have svdq_aa
df2_aa_arrangedrf$Method <- factor(df2_aa_arrangedrf$Method, levels(df2_aa_arrangedrf$Method)[c(1,2,5,4,3)])
plot_RF_aa <- ggplot(df2_aa_arrangedrf, aes(x=A1, y=A2, color = Method, label = sptree_vs_getree)) +
  geom_point(aes(color=Method, shape=Method), alpha=.8, size=4) + 
  guides(size=F) + geom_hline(yintercept = 0, linetype="dashed") +
  geom_vline(xintercept = 0, linetype="dashed") +
  geom_text_repel(size=6, box.padding=2, point.padding = 0.8, show.legend=FALSE) +
  labs(x="PCoA1 (16.8%)", y=" PCoA2 (2.9%)") + 
  theme_classic() + ggtitle("Robinson-Foulds - amino acid") +
  theme(legend.text=element_text(size=12),legend.title=element_text(size=15), plot.title = element_text(hjust = 0.5)) +
  guides(colour = guide_legend(override.aes = list(size=5))) +
  theme(axis.title=element_text(size = 12,face="bold"), axis.text=element_text(size=10,face="bold"))  +
  scale_colour_manual(values = c("#000000", "#CC79A7", "#009E73", "#F0E442", "#56B4E9"))+
  scale_shape_manual(values=c(17,19, 19, 19, 19))
plot_RF_aa
## Warning: Removed 59 rows containing missing values (geom_text_repel).

2.2.2. KC (amino acid matrix)

#KC (amino acid matrix)
tree_aa <- read.nexus("./input_rooted/all_spgenetree_aa_kc.nex")
tree_aa
## 66 phylogenetic trees
tree_aa <- root(tree_aa, "Apiales_Araliaceae_Aralia_undulata", resolve.root = TRUE)
is.rooted.multiPhylo(tree_aa)
##        astral  concatenated genepartition    bestscheme          atpA 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          atpB          atpE          atpF          atpH          atpI 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          cemA          matK          ndhA          ndhB          ndhC 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          ndhD          ndhE          ndhF          ndhG          ndhI 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          ndhJ          petB          petD          petG          petL 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          petN          psaA          psaB          psaC          psaI 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          psaJ          psbA          psbB          psbC          psbD 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          psbE          psbF          psbH          psbI          psbJ 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          psbK          psbM          psbN          psbT          psbZ 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          rbcL         rpl14         rpl20         rpl23          rpl2 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##         rpl33         rpl36          rpoB         rpoC1         rpoC2 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##         rps11         rps12         rps14         rps15         rps18 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          rps2          rps3          rps4          rps7          rps8 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          ycf3 
##          TRUE
class(tree_aa)
## [1] "multiPhylo"
tree_names_aa <- names(tree_aa)
tree_names_aa
##  [1] "astral"        "concatenated"  "genepartition" "bestscheme"   
##  [5] "atpA"          "atpB"          "atpE"          "atpF"         
##  [9] "atpH"          "atpI"          "cemA"          "matK"         
## [13] "ndhA"          "ndhB"          "ndhC"          "ndhD"         
## [17] "ndhE"          "ndhF"          "ndhG"          "ndhI"         
## [21] "ndhJ"          "petB"          "petD"          "petG"         
## [25] "petL"          "petN"          "psaA"          "psaB"         
## [29] "psaC"          "psaI"          "psaJ"          "psbA"         
## [33] "psbB"          "psbC"          "psbD"          "psbE"         
## [37] "psbF"          "psbH"          "psbI"          "psbJ"         
## [41] "psbK"          "psbM"          "psbN"          "psbT"         
## [45] "psbZ"          "rbcL"          "rpl14"         "rpl20"        
## [49] "rpl23"         "rpl2"          "rpl33"         "rpl36"        
## [53] "rpoB"          "rpoC1"         "rpoC2"         "rps11"        
## [57] "rps12"         "rps14"         "rps15"         "rps18"        
## [61] "rps2"          "rps3"          "rps4"          "rps7"         
## [65] "rps8"          "ycf3"
#This is to identify metadata
method_aa <- read.csv("all_spgenetree_aa.csv", h=F)
#Calculating the treespace
treespace_aa <- treespace(tree_aa, nf=65)
#head(treespace_aa)
#head(treespace_aa$pco$li)
plotGrovesD3(treespace_aa$pc, treeNames = tree_names_aa)
#Detecting the number of groups
#id_cluster_aa <- findGroves(treespace_aa)
#1000
#y

id_cluster_aa <- findGroves(treespace_aa, nclust = 3)
plotGrovesD3(id_cluster_aa, treeNames = tree_names_aa)
#making a dataframe with the coordinates of the trees calculated by treespace(), cluster information, and rownmes 
#for names of trees; group is not necessary because I have already calculated the number of groups with findGroves()
df_aa <- data.frame(id_cluster_aa$treespace$pco$li, dataset=tree_names_aa, Method=method_aa$V1,
                    group=id_cluster_aa$groups, name=rownames(id_cluster_aa$treespace$pco$li))
#head(df_aa)
df_aa["sptree_vs_getree"] <- c("as", "c", "gp", "bs", NA, NA, NA, NA, NA, NA, NA, "matK", NA, NA, NA, NA, NA, "ndhF", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "rbcL", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
#head(df_aa)
#change order to avoid overlapping
df2_aa_arranged <- arrange(df_aa, desc(Method))
#head(df2_aa_arranged)
#We use the round() and sum() to verify the percentage explained by each axis in the ordination
round(treespace_aa$pco$eig/sum(treespace_aa$pco$eig),3)
##  [1] 0.264 0.106 0.083 0.061 0.042 0.040 0.038 0.032 0.031 0.026 0.024
## [12] 0.021 0.020 0.017 0.016 0.016 0.014 0.013 0.011 0.010 0.010 0.009
## [23] 0.008 0.007 0.007 0.006 0.006 0.005 0.005 0.004 0.004 0.004 0.004
## [34] 0.003 0.003 0.003 0.003 0.002 0.002 0.002 0.002 0.002 0.001 0.001
## [45] 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.000 0.000
## [56] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
#Axis1 = 0.264; Axis2 = 0.106, Axis3 = 0.083, Axis4 = 0.061, Axis5 = 0.042
df2_aa_arranged$Method <- factor(df2_aa_arranged$Method, levels(df2_aa_arranged$Method)[c(1,2,5,4,3)])
#Now we plot the graph
plot_KC_aa <- ggplot(df2_aa_arranged, aes(x=A1, y=A2, color = Method, label = sptree_vs_getree)) +
  geom_point(aes(color=Method, shape=Method), alpha=.8, size=4) + 
  guides(size=F) + geom_hline(yintercept = 0, linetype="dashed") +
  geom_vline(xintercept = 0, linetype="dashed") +
  geom_text_repel(size=6, box.padding=2, point.padding = 0.8, show.legend=FALSE) +
  labs(x="PCoA1 (26.4%)", y=" PCoA2 (10.6%)") + 
  theme_classic() + ggtitle("Kendall-Colijn - amino acid") +
  theme(legend.text=element_text(size=12),legend.title=element_text(size=15), plot.title = element_text(hjust = 0.5)) +
  guides(colour = guide_legend(override.aes = list(size=5))) +
  theme(axis.title=element_text(size = 12,face="bold"), axis.text=element_text(size=10,face="bold"))  +
  scale_colour_manual(values = c("#000000", "#CC79A7", "#009E73", "#F0E442", "#56B4E9"))+
  scale_shape_manual(values=c(17, 19, 19, 19, 19))
plot_KC_aa
## Warning: Removed 59 rows containing missing values (geom_text_repel).

2.3. Exon matrix

2.3.1. RF (exon matrix)

tree_exrf <- read.nexus("./input_unrooted/all_spgenetree_ex_rf.nex")
is.rooted.multiPhylo(tree_exrf)
##        astral          svdq  concatenated genepartition    bestscheme 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##    iq_ex_atpA    iq_ex_atpB    iq_ex_atpE    iq_ex_atpF    iq_ex_atpH 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##    iq_ex_atpI    iq_ex_cemA    iq_ex_matK    iq_ex_ndhA    iq_ex_ndhB 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##    iq_ex_ndhC    iq_ex_ndhD    iq_ex_ndhE    iq_ex_ndhF    iq_ex_ndhG 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##    iq_ex_ndhI    iq_ex_ndhJ    iq_ex_petB    iq_ex_petD    iq_ex_petG 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##    iq_ex_petL    iq_ex_petN    iq_ex_psaA    iq_ex_psaB    iq_ex_psaC 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##    iq_ex_psaI    iq_ex_psaJ    iq_ex_psbA    iq_ex_psbB    iq_ex_psbC 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##    iq_ex_psbD    iq_ex_psbE    iq_ex_psbF    iq_ex_psbH    iq_ex_psbI 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##    iq_ex_psbJ    iq_ex_psbK    iq_ex_psbM    iq_ex_psbN    iq_ex_psbT 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##    iq_ex_psbZ    iq_ex_rbcL   iq_ex_rpl14   iq_ex_rpl20   iq_ex_rpl23 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##    iq_ex_rpl2   iq_ex_rpl33   iq_ex_rpl36    iq_ex_rpoB   iq_ex_rpoC1 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##   iq_ex_rpoC2   iq_ex_rps11   iq_ex_rps12   iq_ex_rps14   iq_ex_rps15 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##   iq_ex_rps18    iq_ex_rps2    iq_ex_rps3    iq_ex_rps4    iq_ex_rps7 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##    iq_ex_rps8    iq_ex_ycf3 
##         FALSE         FALSE
tree_names_exrf <- names(tree_exrf)
tree_names_exrf
##  [1] "astral"        "svdq"          "concatenated"  "genepartition"
##  [5] "bestscheme"    "iq_ex_atpA"    "iq_ex_atpB"    "iq_ex_atpE"   
##  [9] "iq_ex_atpF"    "iq_ex_atpH"    "iq_ex_atpI"    "iq_ex_cemA"   
## [13] "iq_ex_matK"    "iq_ex_ndhA"    "iq_ex_ndhB"    "iq_ex_ndhC"   
## [17] "iq_ex_ndhD"    "iq_ex_ndhE"    "iq_ex_ndhF"    "iq_ex_ndhG"   
## [21] "iq_ex_ndhI"    "iq_ex_ndhJ"    "iq_ex_petB"    "iq_ex_petD"   
## [25] "iq_ex_petG"    "iq_ex_petL"    "iq_ex_petN"    "iq_ex_psaA"   
## [29] "iq_ex_psaB"    "iq_ex_psaC"    "iq_ex_psaI"    "iq_ex_psaJ"   
## [33] "iq_ex_psbA"    "iq_ex_psbB"    "iq_ex_psbC"    "iq_ex_psbD"   
## [37] "iq_ex_psbE"    "iq_ex_psbF"    "iq_ex_psbH"    "iq_ex_psbI"   
## [41] "iq_ex_psbJ"    "iq_ex_psbK"    "iq_ex_psbM"    "iq_ex_psbN"   
## [45] "iq_ex_psbT"    "iq_ex_psbZ"    "iq_ex_rbcL"    "iq_ex_rpl14"  
## [49] "iq_ex_rpl20"   "iq_ex_rpl23"   "iq_ex_rpl2"    "iq_ex_rpl33"  
## [53] "iq_ex_rpl36"   "iq_ex_rpoB"    "iq_ex_rpoC1"   "iq_ex_rpoC2"  
## [57] "iq_ex_rps11"   "iq_ex_rps12"   "iq_ex_rps14"   "iq_ex_rps15"  
## [61] "iq_ex_rps18"   "iq_ex_rps2"    "iq_ex_rps3"    "iq_ex_rps4"   
## [65] "iq_ex_rps7"    "iq_ex_rps8"    "iq_ex_ycf3"
#This is to identify metadata
method_exrf <- read.csv("all_spgenetree.csv", h=F)
class(method_exrf)
## [1] "data.frame"
#Calculating the treespace
treespace_exrf <- treespace(tree_exrf,  method = "RF", nf=66)
## Warning in is.euclid(distmat): Zero distance(s)
#head(treespace_exrf)
#head(treespace_exrf$pco$li)
plotGrovesD3(treespace_exrf$pc, treeNames = tree_names_exrf)
#Detecting the number of groups
#id_cluster_exrf <- findGroves(treespace_exrf, method = "RF")
#250
#y

id_cluster_exrf <- findGroves(treespace_exrf, method = "RF", nclust = 2)
plotGrovesD3(id_cluster_exrf, treeNames = tree_names_exrf)
#making a dataframe with the coordinates of the trees calculated by treespace(), cluster information, and rownmes 
#for names of trees; group is not necessary because I have already calculated the number of groups with findGroves()
df_exrf <- data.frame(id_cluster_exrf$treespace$pco$li, dataset=tree_names_exrf, Method=method_exrf$V1,
                      group=id_cluster_exrf$groups, name=rownames(id_cluster_exrf$treespace$pco$li))
#head(df_exrf)
df_exrf["sptree_vs_getree"] <- c("as", "sv", "c", "gp", "bs", NA, NA, NA, NA, NA, NA, NA, "matK", NA, NA, NA, NA, NA, "ndhF", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "rbcL", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
#head(df_exrf)
#change order to avoid overlapping
df2_ex_arrangedrf <- arrange(df_exrf, desc(Method))
#head(df2_ex_arrangedrf)
#We use the round() and sum() to verify the percentage explained by each axis in the ordination
round(treespace_exrf$pco$eig/sum(treespace_exrf$pco$eig),3)
##  [1] 0.147 0.050 0.041 0.040 0.039 0.037 0.036 0.035 0.031 0.031 0.030
## [12] 0.027 0.026 0.026 0.025 0.023 0.023 0.023 0.020 0.020 0.018 0.017
## [23] 0.017 0.016 0.014 0.014 0.013 0.011 0.011 0.010 0.009 0.009 0.009
## [34] 0.008 0.008 0.007 0.007 0.007 0.006 0.005 0.005 0.005 0.005 0.004
## [45] 0.004 0.004 0.003 0.003 0.003 0.002 0.002 0.002 0.002 0.002 0.002
## [56] 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.000 0.000
#Axis1 = 0.147; Axis2 = 0.050, Axis3 = 0.041, Axis4 = 0.040, Axis5 = 0.039
#Now we plot the graph
df2_ex_arrangedrf$Method <- factor(df2_ex_arrangedrf$Method, levels(df2_ex_arrangedrf$Method)[c(1,4,2,5,6,3)])

plot_RF_ex <- ggplot(df2_ex_arrangedrf, aes(x=A1, y=A2, color = Method, label = sptree_vs_getree)) +
  geom_point(aes(color=Method, shape=Method), alpha=.6, size=4) + 
  guides(size=F) + geom_hline(yintercept = 0, linetype="dashed") +
  geom_vline(xintercept = 0, linetype="dashed") +
  geom_text_repel(size=6, box.padding=2, point.padding = 0.8, show.legend=FALSE) +
  labs(x="PCoA1 (14.7%)", y=" PCoA2 (5.0%)") + 
  theme_classic() + ggtitle("Robinson-Foulds - exon") +
  theme(legend.text=element_text(size=12),legend.title=element_text(size=15), plot.title = element_text(hjust = 0.5)) +
  guides(colour = guide_legend(override.aes = list(size=5))) +
  theme(axis.title=element_text(size = 12,face="bold"), axis.text=element_text(size=10,face="bold")) +
  scale_colour_manual(values = c("#000000", "#E69F00", "#CC79A7", "#009E73", "#F0E442", "#56B4E9"))+
  scale_shape_manual(values=c(17, 17, 19, 19, 19, 19))
plot_RF_ex
## Warning: Removed 59 rows containing missing values (geom_text_repel).

2.3.2. KC (exon matrix)

#KC (Exon matrix)
tree_ex <- read.nexus("input_rooted/all_spgenetree_ex_kc.nex")
is.rooted.multiPhylo(tree_ex)
##        astral          svdq  concatenated genepartition    bestscheme 
##          TRUE         FALSE         FALSE         FALSE         FALSE 
##          atpA          atpB          atpE          atpF          atpH 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          atpI          cemA          matK          ndhA          ndhB 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          ndhC          ndhD          ndhE          ndhF          ndhG 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          ndhI          ndhJ          petB          petD          petG 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          petL          petN          psaA          psaB          psaC 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          psaI          psaJ          psbA          psbB          psbC 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          psbD          psbE          psbF          psbH          psbI 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          psbJ          psbK          psbM          psbN          psbT 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          psbZ          rbcL         rpl14         rpl20         rpl23 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          rpl2         rpl33         rpl36          rpoB         rpoC1 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##         rpoC2         rps11         rps12         rps14         rps15 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##         rps18          rps2          rps3          rps4          rps7 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          rps8          ycf3 
##          TRUE          TRUE
tree_ex <- root(tree_ex, "Apiales_Araliaceae_Aralia_undulata", resolve.root = TRUE)
is.rooted.multiPhylo(tree_ex)
##        astral          svdq  concatenated genepartition    bestscheme 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          atpA          atpB          atpE          atpF          atpH 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          atpI          cemA          matK          ndhA          ndhB 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          ndhC          ndhD          ndhE          ndhF          ndhG 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          ndhI          ndhJ          petB          petD          petG 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          petL          petN          psaA          psaB          psaC 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          psaI          psaJ          psbA          psbB          psbC 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          psbD          psbE          psbF          psbH          psbI 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          psbJ          psbK          psbM          psbN          psbT 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          psbZ          rbcL         rpl14         rpl20         rpl23 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          rpl2         rpl33         rpl36          rpoB         rpoC1 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##         rpoC2         rps11         rps12         rps14         rps15 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##         rps18          rps2          rps3          rps4          rps7 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          rps8          ycf3 
##          TRUE          TRUE
tree_names_ex <- names(tree_ex)
tree_names_ex
##  [1] "astral"        "svdq"          "concatenated"  "genepartition"
##  [5] "bestscheme"    "atpA"          "atpB"          "atpE"         
##  [9] "atpF"          "atpH"          "atpI"          "cemA"         
## [13] "matK"          "ndhA"          "ndhB"          "ndhC"         
## [17] "ndhD"          "ndhE"          "ndhF"          "ndhG"         
## [21] "ndhI"          "ndhJ"          "petB"          "petD"         
## [25] "petG"          "petL"          "petN"          "psaA"         
## [29] "psaB"          "psaC"          "psaI"          "psaJ"         
## [33] "psbA"          "psbB"          "psbC"          "psbD"         
## [37] "psbE"          "psbF"          "psbH"          "psbI"         
## [41] "psbJ"          "psbK"          "psbM"          "psbN"         
## [45] "psbT"          "psbZ"          "rbcL"          "rpl14"        
## [49] "rpl20"         "rpl23"         "rpl2"          "rpl33"        
## [53] "rpl36"         "rpoB"          "rpoC1"         "rpoC2"        
## [57] "rps11"         "rps12"         "rps14"         "rps15"        
## [61] "rps18"         "rps2"          "rps3"          "rps4"         
## [65] "rps7"          "rps8"          "ycf3"
#This is to identify metadata
method_ex <- read.csv("all_spgenetree.csv", h=F)
#Calculating the treespace
treespace_ex <- treespace(tree_ex, nf=66)
#head(treespace_ex$pco$li)
plotGrovesD3(treespace_ex$pc, treeNames = tree_names_ex)
#Detecting the number of groups
#id_cluster_ex <- findGroves(treespace_ex)
#580
#y

id_cluster_ex <- findGroves(treespace_ex, nclust = 3)
plotGrovesD3(id_cluster_ex, treeNames = tree_names_ex)
#making a dataframe with the coordinates of the trees calculated by treespace(), cluster information, and rownmes 
#for names of trees; group is not necessary because I have already calculated the number of groups with findGroves()
df_ex <- data.frame(id_cluster_ex$treespace$pco$li, dataset=tree_names_ex, Method=method_ex$V1, 
                    group=id_cluster_ex$groups, name=rownames(id_cluster_ex$treespace$pco$li))
#head(df_ex)
df_ex["sptree_vs_getree"] <- c("as", "sv", "c", "gp", "bs", NA, NA, NA, NA, NA, NA, NA, "matK", NA, NA, NA, NA, NA, "ndhF", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "rbcL", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
#head(df_ex)
#change order to avoid overlapping
df2_ex_arranged <- arrange(df_ex, desc(Method))
#head(df2_ex_arranged)
#We use the round() and sum() to verify the percentage explained by each axis in the ordination
round(treespace_ex$pco$eig/sum(treespace_ex$pco$eig),3)
##  [1] 0.521 0.086 0.061 0.047 0.031 0.027 0.022 0.021 0.019 0.017 0.015
## [12] 0.013 0.012 0.010 0.010 0.009 0.008 0.007 0.006 0.006 0.005 0.004
## [23] 0.004 0.004 0.003 0.003 0.003 0.003 0.002 0.002 0.002 0.002 0.002
## [34] 0.002 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
## [45] 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [56] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
#Axis1 = 0.521 Axis2 = 0.086, Axis3 = 0.061, Axis4 = 0.047, Axis5 = 0.031
#Now we plot the graph
df2_ex_arranged$Method <- factor(df2_ex_arranged$Method, levels(df2_ex_arranged$Method)[c(1,3,6,2,4,5)])

plot_KC_ex <- ggplot(df2_ex_arranged, aes(x=A1, y=A2, color = Method, label = sptree_vs_getree)) +
  geom_point(aes(color=Method, shape=Method), alpha=.6, size=4) + 
  guides(size=F) + geom_hline(yintercept = 0, linetype="dashed") +
  geom_vline(xintercept = 0, linetype="dashed") +
  geom_text_repel(size=6, box.padding=2, point.padding=0.8, show.legend=FALSE) +
  labs(x="PCoA1 (52.1%)", y=" PCoA2 (8.6%)") + 
  theme_classic() + ggtitle("Kendall-Colijn - exon") +
  theme(legend.text=element_text(size=12),legend.title=element_text(size=15), plot.title = element_text(hjust = 0.5)) +
  guides(colour = guide_legend(override.aes = list(size=5))) +
  theme(axis.title=element_text(size = 12,face="bold"), axis.text=element_text(size=10,face="bold"))+
  scale_colour_manual(values = c("#000000", "#E69F00", "#CC79A7", "#009E73", "#F0E442", "#56B4E9"))+
  scale_shape_manual(values=c(17, 17, 19, 19, 19, 19))
plot_KC_ex
## Warning: Removed 59 rows containing missing values (geom_text_repel).

2.4. Codon matrix

2.4.1. RF (codon matrix)

#RF (codon matrix)
tree_cdrf <- read.nexus("./input_unrooted/all_spgenetree_cd_rf.nex")
is.rooted.multiPhylo(tree_cdrf)
##        astral          svdq  concatenated genepartition    bestscheme 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          atpA          atpB          atpE          atpF          atpH 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          atpI          cemA          matK          ndhA          ndhB 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          ndhC          ndhD          ndhE          ndhF          ndhG 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          ndhI          ndhJ          petB          petD          petG 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          petL          petN          psaA          psaB          psaC 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          psaI          psaJ          psbA          psbB          psbC 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          psbD          psbE          psbF          psbH          psbI 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          psbJ          psbK          psbM          psbN          psbT 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          psbZ          rbcL         rpl14         rpl20         rpl23 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          rpl2         rpl33         rpl36          rpoB         rpoC1 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##         rpoC2         rps11         rps12         rps14         rps15 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##         rps18          rps2          rps3          rps4          rps7 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##          rps8          ycf3 
##         FALSE         FALSE
#tree_ex <- unroot(tree_cdrf)
tree_names_cdrf <- names(tree_cdrf)
tree_names_cdrf
##  [1] "astral"        "svdq"          "concatenated"  "genepartition"
##  [5] "bestscheme"    "atpA"          "atpB"          "atpE"         
##  [9] "atpF"          "atpH"          "atpI"          "cemA"         
## [13] "matK"          "ndhA"          "ndhB"          "ndhC"         
## [17] "ndhD"          "ndhE"          "ndhF"          "ndhG"         
## [21] "ndhI"          "ndhJ"          "petB"          "petD"         
## [25] "petG"          "petL"          "petN"          "psaA"         
## [29] "psaB"          "psaC"          "psaI"          "psaJ"         
## [33] "psbA"          "psbB"          "psbC"          "psbD"         
## [37] "psbE"          "psbF"          "psbH"          "psbI"         
## [41] "psbJ"          "psbK"          "psbM"          "psbN"         
## [45] "psbT"          "psbZ"          "rbcL"          "rpl14"        
## [49] "rpl20"         "rpl23"         "rpl2"          "rpl33"        
## [53] "rpl36"         "rpoB"          "rpoC1"         "rpoC2"        
## [57] "rps11"         "rps12"         "rps14"         "rps15"        
## [61] "rps18"         "rps2"          "rps3"          "rps4"         
## [65] "rps7"          "rps8"          "ycf3"
#This is to identify metadata
method_cdrf <- read.csv("all_spgenetree.csv", h=F)
#Calculating the treespace
treespace_cdrf <- treespace(tree_cdrf,  method = "RF", nf=66)
## Warning in is.euclid(distmat): Zero distance(s)
#head(treespace_cdrf)
#head(treespace_cdrf$pco$li)
plotGrovesD3(treespace_cdrf$pc, treeNames = tree_names_cdrf)
#Detecting the number of groups
#id_cluster_cdrf <- findGroves(treespace_cdrf, method = "RF")
#200
#y

id_cluster_cdrf <- findGroves(treespace_cdrf, method = "RF", nclust = 2)
plotGrovesD3(id_cluster_cdrf, treeNames = tree_names_cdrf)
#making a dataframe with the coordinates of the trees calculated by treespace(), cluster information, and rownmes 
#for names of trees; group is not necessary because I have already calculated the number of groups with findGroves()
df_cdrf <- data.frame(id_cluster_cdrf$treespace$pco$li, dataset=tree_names_cdrf, Method=method_cdrf$V1,
                      group=id_cluster_cdrf$groups, name=rownames(id_cluster_cdrf$treespace$pco$li))
#head(df_cdrf)
df_cdrf["sptree_vs_getree"] <- c("as", "sv", "c", "gp", "bs", NA, NA, NA, NA, NA, NA, NA, "matK", NA, NA, NA, NA, NA, "ndhF", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "rbcL", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
#head(df_cdrf)
#change order to avoid overlapping
df2_cd_arrangedrf <- arrange(df_cdrf, desc(Method))
#head(df2_cd_arrangedrf)
#We use the round() and sum() to verify the percentage explained by each axis in the ordination
round(treespace_cdrf$pco$eig/sum(treespace_cdrf$pco$eig),3)
##  [1] 0.147 0.050 0.041 0.040 0.039 0.037 0.036 0.035 0.032 0.031 0.030
## [12] 0.027 0.026 0.026 0.024 0.023 0.023 0.023 0.020 0.020 0.018 0.017
## [23] 0.017 0.016 0.014 0.014 0.013 0.011 0.011 0.010 0.009 0.009 0.009
## [34] 0.008 0.008 0.008 0.007 0.007 0.006 0.005 0.005 0.005 0.005 0.004
## [45] 0.004 0.003 0.003 0.003 0.003 0.002 0.002 0.002 0.002 0.002 0.002
## [56] 0.001 0.001 0.001 0.001 0.001 0.001 0.000 0.000 0.000 0.000
#Axis1 = 0.147, Axis2 = 0.050, Axis3 = 0.041, Axis4 = 0.040, Axis5 = 0.039
#Now we plot the graph
df2_cd_arrangedrf$Method <- factor(df2_cd_arrangedrf$Method, levels(df2_cd_arrangedrf$Method)[c(1,2,3,6,5,4)])

plot_RF_cd <- ggplot(df2_cd_arrangedrf, aes(x=A1, y=A2, color = Method, label = sptree_vs_getree)) +
  geom_point(aes(color=Method, shape=Method), alpha=.6, size=4) + 
  guides(size=F) + geom_hline(yintercept = 0, linetype="dashed") +
  geom_vline(xintercept = 0, linetype="dashed") +
  geom_text_repel(size=6, box.padding=2, point.padding = 0.8, show.legend=FALSE) +
  labs(x="PCoA1 (14.7%)", y=" PCoA2 (0.05%)") + 
  theme_classic() + ggtitle("Robinson-Foulds - codon") +
  theme(legend.text=element_text(size=12),legend.title=element_text(size=15), plot.title = element_text(hjust = 0.5)) +
  guides(colour = guide_legend(override.aes = list(size=5))) +
  theme(axis.title=element_text(size = 12,face="bold"), axis.text=element_text(size=10,face="bold")) +
  scale_colour_manual(values = c("#000000", "#E69F00", "#CC79A7", "#009E73", "#F0E442", "#56B4E9"))+
  scale_shape_manual(values=c(17, 17, 19, 19, 19, 19))
plot_RF_cd
## Warning: Removed 59 rows containing missing values (geom_text_repel).

2.4.2. KC (codon matrix)

#KC (codon matrix)
tree_cd <- read.nexus("./input_rooted/all_spgenetree_cd_kc.nex")
tree_cd
## 67 phylogenetic trees
is.rooted.multiPhylo(tree_cd)
##        astral          svdq  concatenated genepartition    bestscheme 
##          TRUE         FALSE         FALSE         FALSE         FALSE 
##          atpA          atpB          atpE          atpF          atpH 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          atpI          cemA          matK          ndhA          ndhB 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          ndhC          ndhD          ndhE          ndhF          ndhG 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          ndhI          ndhJ          petB          petD          petG 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          petL          petN          psaA          psaB          psaC 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          psaI          psaJ          psbA          psbB          psbC 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          psbD          psbE          psbF          psbH          psbI 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          psbJ          psbK          psbM          psbN          psbT 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          psbZ          rbcL         rpl14         rpl20         rpl23 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          rpl2         rpl33         rpl36          rpoB         rpoC1 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##         rpoC2         rps11         rps12         rps14         rps15 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##         rps18          rps2          rps3          rps4          rps7 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          rps8          ycf3 
##          TRUE          TRUE
tree_cd <- root(tree_cd, "Apiales_Araliaceae_Aralia_undulata", resolve.root = TRUE)
is.rooted.multiPhylo(tree_cd)
##        astral          svdq  concatenated genepartition    bestscheme 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          atpA          atpB          atpE          atpF          atpH 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          atpI          cemA          matK          ndhA          ndhB 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          ndhC          ndhD          ndhE          ndhF          ndhG 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          ndhI          ndhJ          petB          petD          petG 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          petL          petN          psaA          psaB          psaC 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          psaI          psaJ          psbA          psbB          psbC 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          psbD          psbE          psbF          psbH          psbI 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          psbJ          psbK          psbM          psbN          psbT 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          psbZ          rbcL         rpl14         rpl20         rpl23 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          rpl2         rpl33         rpl36          rpoB         rpoC1 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##         rpoC2         rps11         rps12         rps14         rps15 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##         rps18          rps2          rps3          rps4          rps7 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          rps8          ycf3 
##          TRUE          TRUE
tree_names_cd <- names(tree_cd)
tree_names_cd
##  [1] "astral"        "svdq"          "concatenated"  "genepartition"
##  [5] "bestscheme"    "atpA"          "atpB"          "atpE"         
##  [9] "atpF"          "atpH"          "atpI"          "cemA"         
## [13] "matK"          "ndhA"          "ndhB"          "ndhC"         
## [17] "ndhD"          "ndhE"          "ndhF"          "ndhG"         
## [21] "ndhI"          "ndhJ"          "petB"          "petD"         
## [25] "petG"          "petL"          "petN"          "psaA"         
## [29] "psaB"          "psaC"          "psaI"          "psaJ"         
## [33] "psbA"          "psbB"          "psbC"          "psbD"         
## [37] "psbE"          "psbF"          "psbH"          "psbI"         
## [41] "psbJ"          "psbK"          "psbM"          "psbN"         
## [45] "psbT"          "psbZ"          "rbcL"          "rpl14"        
## [49] "rpl20"         "rpl23"         "rpl2"          "rpl33"        
## [53] "rpl36"         "rpoB"          "rpoC1"         "rpoC2"        
## [57] "rps11"         "rps12"         "rps14"         "rps15"        
## [61] "rps18"         "rps2"          "rps3"          "rps4"         
## [65] "rps7"          "rps8"          "ycf3"
#This is to identify metadata
method_cd <- read.csv("all_spgenetree.csv", h=F)
#Calculating the treespace
treespace_cd <- treespace(tree_cd, nf=66)
#head(treespace_cd)
#head(treespace_cd$pco$li)
plotGrovesD3(treespace_cd$pc, treeNames = tree_names_cd)
#Detecting the number of groups
#id_cluster_cd <- findGroves(treespace_cd)
#580
#y

id_cluster_cd <- findGroves(treespace_cd, nclust = 3)
plotGrovesD3(id_cluster_cd, treeNames = tree_names_cd)
#making a dataframe with the coordinates of the trees calculated by treespace(), cluster information, and rownmes 
#for names of trees; group is not necessary because I have already calculated the number of groups with findGroves()
df_cd <- data.frame(id_cluster_cd$treespace$pco$li, dataset=tree_names_cd, Methods=method_cd$V1, 
                    group=id_cluster_cd$groups, name=rownames(id_cluster_cd$treespace$pco$li))
#head(df_cd)
df_cd["sptree_vs_getree"] <- c("as", "sv", "c", "gp", "bs", NA, NA, NA, NA, NA, NA, NA, "matK", NA, NA, NA, NA, NA, "ndhF", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "rbcL", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
#head(df_cd)
#change order to avoid overlapping
df2_cd_arranged <- arrange(df_cd, desc(Methods))
#head(df2_cd_arranged)
#We use the round() and sum() to verify the percentage explained by each axis in the ordination
round(treespace_cd$pco$eig/sum(treespace_cd$pco$eig),3)
##  [1] 0.521 0.086 0.061 0.047 0.031 0.027 0.022 0.021 0.019 0.017 0.015
## [12] 0.013 0.012 0.010 0.010 0.009 0.008 0.007 0.006 0.006 0.005 0.004
## [23] 0.004 0.004 0.003 0.003 0.003 0.003 0.002 0.002 0.002 0.002 0.002
## [34] 0.002 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
## [45] 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [56] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
#Axis1 = 0.521, Axis2 = 0.086, Axis3 = 0.061, Axis4 = 0.047, Axis5 = 0.031
#Now we plot the graph
df2_cd_arranged$Method <- factor(df2_cd_arranged$Method, levels(df2_cd_arranged$Method)[c(1,3,6,5,4,2)])

plot_KC_cd <- ggplot(df2_cd_arranged, aes(x=A1, y=A2, color = Method, label = sptree_vs_getree)) +
  geom_point(aes(color=Method, shape=Method), alpha=.6, size=4) + 
  guides(size=F) + geom_hline(yintercept = 0, linetype="dashed") +
  geom_vline(xintercept = 0, linetype="dashed") +
  geom_text_repel(size=6, box.padding=2, point.padding = 0.8, show.legend=FALSE) +
  labs(x="PCoA1 (52.1%)", y=" PCoA2 (8.6%)") + 
  theme_classic() + ggtitle("Kendall-Colijn - codon") +
  theme(legend.text=element_text(size=12),legend.title=element_text(size=15), plot.title = element_text(hjust = 0.5)) +
  guides(colour = guide_legend(override.aes = list(size=5))) +
  theme(axis.title=element_text(size = 12,face="bold"), axis.text=element_text(size=10,face="bold"))  +
  scale_colour_manual(values = c("#000000", "#E69F00", "#CC79A7", "#009E73", "#F0E442", "#56B4E9"))+
  scale_shape_manual(values=c(17, 17, 19, 19, 19, 19))
plot_KC_cd
## Warning: Removed 59 rows containing missing values (geom_text_repel).

Now we prepared the final figures:

#GENE and AMINO ACIDS
#Now we combine both graphs using plot_grid()
#First combining both graphs without a legend
legend3 <- get_legend(plot_RF_ge + theme(legend.position="bottom"))
## Warning: Removed 59 rows containing missing values (geom_text_repel).
ge_aa_plots <- plot_grid(plot_RF_ge + theme(legend.position="none"),
                         plot_KC_ge + theme(legend.position="none"),
                         plot_RF_aa + theme(legend.position="none"),
                         plot_KC_aa + theme(legend.position="none"),
                         align = 'vh', labels = c("a)", "b)", "c)", "d)"), hjust = -1, nrow = 2) 
## Warning: Removed 59 rows containing missing values (geom_text_repel).

## Warning: Removed 59 rows containing missing values (geom_text_repel).

## Warning: Removed 59 rows containing missing values (geom_text_repel).

## Warning: Removed 59 rows containing missing values (geom_text_repel).
ge_aa_plots

#Now adding the legend
ge_aa_plots_onelegend <- plot_grid(ge_aa_plots, legend3, ncol=1, rel_heights = c(3, .5))
ge_aa_plots_onelegend

#Finally adding a title
title <- ggdraw() + draw_label("Gene Trees Landscape", fontface='bold')
ge_aa_plot <- plot_grid(title, ge_aa_plots_onelegend, ncol=1, rel_heights = c(0.1, 1))
#Saving pdf determining the size of the figure:
save_plot("./Figure6.pdf", ge_aa_plot, base_width=9, base_height=11)
#EXON and CODON (SI)
#Now we combine both graphs using plot_grid()
#First combining both graphs without a legend
legend4 <- get_legend(plot_RF_ex + theme(legend.position="bottom"))
## Warning: Removed 59 rows containing missing values (geom_text_repel).
ex_cd_plots <- plot_grid(plot_RF_ex + theme(legend.position="none"),
                         plot_KC_ex + theme(legend.position="none"),
                         plot_RF_cd + theme(legend.position="none"),
                         plot_KC_cd + theme(legend.position="none"),
                         align = 'vh', labels = c("a)", "b)", "c)", "d)"), hjust = -1, nrow = 2) 
## Warning: Removed 59 rows containing missing values (geom_text_repel).

## Warning: Removed 59 rows containing missing values (geom_text_repel).

## Warning: Removed 59 rows containing missing values (geom_text_repel).

## Warning: Removed 59 rows containing missing values (geom_text_repel).
ex_cd_plots

#Now adding the legend
ex_cd_plots_onelegend <- plot_grid(ex_cd_plots, legend4, ncol=1, rel_heights = c(3, .5))
ex_cd_plots_onelegend

#Finally adding a title
title <- ggdraw() + draw_label("Gene Tree Landscape", fontface='bold')
ex_cd_plot <- plot_grid(title, ex_cd_plots_onelegend, ncol=1, rel_heights=c(0.1, 1))
#Saving pdf determining the size of the figure:
save_plot("./SupFigureS20.pdf", ex_cd_plot, base_width=9, base_height=11)